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We present a Brook streaming language implementation of the 3-D discontinuous Galerkin 
method for compressible fluid flow on tetrahedral meshes. Efficient implementation of the 
discontinuous Galerkin method using the streaming model of computation introduces several 
algorithmic design challenges. Using a cycle-accurate simulator, performance characteristics 
have been obtained for the Stanford Merrimac stream processor. The current Merrimac design 
achieves 128 Gflops per chip and the desktop board is populated with 16 chips yielding a peak 
performance of 2 Teraflops. Total parts cost for the desktop board is less than $20K. Current 
cycle- accurate simulations for discretizations of the 3-D compressible flow equations yield 
approximately 40-50% of the peak performance of the Merrimac streaming processor chip. 
Ongoing work includes the assessment of the performance of the same algorithm on the 2 
Teraflop desktop board with a target goal of achieving 1 Teraflop performance. 
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A DG Finite Element Method for Conservation Laws 


StreamFEM implements the Discontinuous Galerkin (DG) finite element method 
for systems of nonlinear conservation laws in divergence form in 2-D or 3-D: 
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The DG Finite Element Variational Statement 
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Variable Arithmetic Intensity 


StreamFEM includes discontinuous Galerkin models of several representative 
nonlinear partial differential equation (PDE) systems of increasing complexity in 3-D: 


•Scalar Advection (1 PDE) 

•Euler Equations (5 PDEs) 
•Magnetohydrodynamics (8 PDEs) 


Brook Functional Simulation of 
Flow Over a 2-D Forward Facing Step 



StreamFEM-2D Hardware Simulated Performance 


StreamFEM also includes various piecewise polynomial representations with 
an increasing number of degrees of freedom (dofs) ranging from piecewise 
constant to piecewise cubic polynomial approximation in 3-D 

•Piecewise constant elements (1 dot / (element-equation) ) 

•Piecewise linear elements (4 dofs / (element-equation) ) 

•Piecewise quadratic elements (10 dofs / (element-equation) ) 

•Piecewise cubic elements (20 dofs / (element-equation) ) 

By increasing the number of PDEs and the number of degrees of freedom per element, 
it is possible to alter the overall arithmetic intensity of the computation by lOx or more. 

StreamFEM Flow Chart 

StreamFEM has been implemented in the Brook stream language and later translated 
into StreamC/KernelC. The current algorithm utilizes a simple Runge-Kutta(l) time stepping algorithm: 



Element Type (Equation Type) 
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For each timestep: 


.Loop over edges: 

I • Gather 2 element states; 

• Compute-flux terms ' 

• Store fluxes to memory 
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» Gather 3 ’flux terms 


• Compute interior term 



and update element'*. 


•• Store updated element 
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Future Directions 

•Optimized simulations of StreamFEM-3D (in progress) 
•Multiple node performance simulation and optimization 
•Streaming language implementation of sparse linear 
algebra kernels 





